Make it easier for user to search for tags#542
Conversation
#### Closes #231 Applying the algorithm for `Needles and Haystack` to find and match tag in tags, for example:  This only applies to searching tag_name with more than 3 in length, and at least 80% of its letters are found, from left to right. There are 3 levels of checking, stop at first found: - Check if exact name ( case insensitive ) O(1) getting from a dictionary Dict[str, Tag] - Check for all tags that has 100% matching via algorithm - Check for all tags that has >= 80% matching If there are more than one hit, it will be shown as suggestions:  In order to avoid api being called multiple times, I've implemented a cache to only refresh itself when the is a gap of more than 5 minutes from the last api call to get all tags. Editing / Adding / Deleting tags will also modify the cache directly. ##### What about other solution like fuzzywuzzy? fuzzywuzzy was considered for using, but from testing, it was giving much lower scores than expected: Code used to test: ```py from fuzzywuzzy import fuzz def _fuzzy_search(search: str, target: str) -> bool: found = 0 index = 0 _search = search.lower().replace(' ', '') _target = target.lower().replace(' ', '') for letter in _search: index = _target.find(letter, index) if index == -1: break found += index > 0 # return found / len(_search) * 100 return ( found / len(_search) * 100, fuzz.ratio(search, target), fuzz.partial_ratio(search, target) ) tests = ( 'this-is-gonna-be-fun', 'this-too-will-be-fun' ) for test in tests: print(test, '->', _fuzzy_search('this too fun', test)) ``` Result from test: ```py this-is-gonna-be-fun -> (30.0, 50, 50) this-too-will-be-fun -> (90.0, 62, 58) ```
|
Looking at the fuzzy search, have you considered the built-in |
|
We actually discussed if it'd be better to do the fuzzy search server-side on the API. I haven't looked into it deeply but here are some relevant links: https://docs.djangoproject.com/en/2.2/ref/contrib/postgres/search/ I'm not sure if it'd be better to do it server or client side. I think that if there is room for fuzzy search to be used in the future with other endpoints (new or existing), then it should be server side. Another factor would be to see how accurate the pg search features are for our needs here. |
I've taken a look at it, it proves to be quite useful to get the differences in ...
s = difflib.SequenceMatcher(lambda x: x in ' -', search, target)
return (
found / len(_search) * 100,
('fuzzy', fuzz.ratio(search, target), fuzz.partial_ratio(search, target)),
('difflib', tuple(map(lambda x: x * 100, (s.ratio(), s.real_quick_ratio(), s.quick_ratio()))))
)
# --------------------------------
this too fun & this-is-gonna-be-fun -> (30.0, ('fuzzy', 50, 50), ('difflib', (50.0, 75.0, 50.0)))
this too fun & this-too-will-be-fun -> (90.0, ('fuzzy', 62, 58), ('difflib', (62.5, 75.0, 62.5)))I've thought about either this should be done from the API or from the bot, I think having a cache on the bot will give better performance, specially if we do not modify tags from the site-side and restrict modifying tags to be via bot's commands only, then we can maintain a cache that's perfectly synced with the site. Postgres search feature looks powerful too, I'll definitely want to see how it performs as well. |
|
Did we decide on an approach for this? Bot-side or server-side? I kinda like the idea of shipping a query off to the API and having postgres do its thing. |
|
Given #388 it's better to keep it client-side as it would eventually have to be client-side anyway. However, that issue is stale so I don't know if we still want to do that. If not, then I agree with doing it on the server-side. |
|
That's a good point; I'd forgotten about that. Let's ask our tag master, @fiskenslakt, what his current opinion on the matter is to make sure we get this thing moving again. |
|
I closed #388. The meta repo now contains markdown files of any of our tags for now for the public to read through and be able to submit PRs for adding or editing tags. At the moment the process of adding or editing the tags is done via bot command in-server or by using the site's admin page (mods+ now have full access to the tag admin page). There's improvements that can be done to make things easier and to automate/integrate the process, but I'm of the opinion that tags will continue to live on the database, be accessible via API and editable via web admin, and as such we should probably stick to doing fuzzymatching api-side. |
MarkKoz
left a comment
There was a problem hiding this comment.
Since this is already done and there has been no progress re-implementing this on the site, I think it is best to get this PR merged for the time being.
- Changed type of `self._last_fetch` to `float` and give it the initial value of `0.0` instead of `None` - Assigned `time.time()` to `time_now` to avoid calling this function twice. - Added `self._last_fetch = time_now` after calling the api call.
…ciency. - Matching scores will be calculated once now and stored in the dict `scores`. - Allow `_get_suggestions()` to go through a list of score threshold and return the first list of matching tags that's not empty and above the threshold. This avoid calling the function multiple time like before ( `self._get_suggestions(tag_name, 100) or self._get_suggestions(tag_name, 80)` for example, is calling this function twice, and is inefficient ) - Deleted commented line. - Added `typing` module for more typehints.
MarkKoz
left a comment
There was a problem hiding this comment.
For a tag named foo-bar, foobars will not match and neither will foo_bar. foobar does match. The tags command doesn't seem to like spaces in tag names - it will never match.
- Added a regex to remove non-alphabet ( `[^a-z]` with `re.IGNORECASE` )
… 60] - Since it is returning as soon as there are suggestions found for a threshold, this will give a better reflection of what the bot thinks user is searching for.
|
Interesting! I've added a regex to remove all non-alphabet, as well as increasing threshold from |
|
In some cases that still isn't working so well:
Also discovered an unrelated issue in which it can't handle DELETE or GET requests for tags with spaces in them (returns 404). Might be a URL encoding issue since the tag is part of the URL path. It can POST fine because the tag name is instead part of the JSON. |
|
Hmm, I've added another complexity that will force this to search from words to words, here's the snippets I used to test import re
from typing import Dict, List, Optional
REGEX_NON_ALPHABET = re.compile(r"[^a-z]", re.MULTILINE & re.IGNORECASE)
stuff = ['args-kwargs', 'ask', 'class', 'classmethod', 'codeblock', 'decorators', 'dictcomps', 'enumerate', 'except', 'exit()', 'f-strings', 'foo', 'functions-are-objects', 'global', 'if-name-main', 'indent', 'inline', 'iterate-dict', 'listcomps', 'mutable-default-args', 'names', 'no-dm',
'off-topic', 'open', 'or-gotcha', 'param-arg', 'paste', 'pathlib', 'pep8', 'positional-keyword', 'precedence', 'quotes', 'relative-path', 'repl', 'return', 'round', 'scope', 'seek', 'self', 'star-imports', 'traceback', 'windows-path', 'with', 'xy-problem', 'ytdl', 'zen', 'zip', ]
_cache = dict(zip(stuff, stuff))
def _fuzzy_search(search: str, target: str) -> int:
"""A simple scoring algorithm based on how many letters are found / total, with order in mind."""
current, index = 0, 0
_search = REGEX_NON_ALPHABET.sub('', search.lower())
_targets = iter(REGEX_NON_ALPHABET.split(target.lower()))
_target = next(_targets)
try:
while True:
while index < len(_target) and _search[current] == _target[index]:
current += 1
index += 1
index, _target = 0, next(_targets)
except (StopIteration, IndexError):
pass
return current / len(_search) * 100
def _get_suggestions(tag_name: str, thresholds: Optional[List[int]] = None) -> List[str]:
"""Return a list of suggested tags."""
scores: Dict[str, int] = {
tag_title: _fuzzy_search(tag_name, tag)
for tag_title, tag in _cache.items()
}
thresholds = thresholds or [100, 90, 80, 70, 60]
for threshold in thresholds:
suggestions = [
_cache[tag_title]
for tag_title, matching_score in scores.items()
if matching_score >= threshold
]
if suggestions:
return f"{repr(tag_name)} - {suggestions}"
return f"{repr(tag_name)} not found"
print(_get_suggestions('fstring'))
print(_get_suggestions('fstrings'))
print(_get_suggestions('fstr'))
print(_get_suggestions('f-str'))
print(_get_suggestions('f-string'))
print(_get_suggestions('f-strings'))
print(_get_suggestions('asks'))
print(_get_suggestions('foos'))
print(_get_suggestions('dict'))
print(_get_suggestions('opens'))
print(_get_suggestions('or'))
print(_get_suggestions('or-g'))
print(_get_suggestions('or-'))
print(_get_suggestions('got'))
print(_get_suggestions('path'))
print(_get_suggestions('main'))
print(_get_suggestions('if'))
print(_get_suggestions('if main'))
print(_get_suggestions('asdfasdf'))Here are the results: 'fstring' - ['f-strings']
'fstrings' - ['f-strings']
'fstr' - ['f-strings']
'f-str' - ['f-strings']
'f-string' - ['f-strings']
'f-strings' - ['f-strings']
'asks' - ['ask']
'foos' - ['foo']
'dict' - ['dictcomps', 'iterate-dict']
'opens' - ['open']
'or' - ['or-gotcha']
'or-g' - ['or-gotcha']
'or-' - ['or-gotcha']
'got' - ['or-gotcha']
'path' - ['pathlib', 'relative-path', 'windows-path']
'main' - ['if-name-main']
'if' - ['if-name-main']
'if main' - ['if-name-main']
'asdfasdf' not found |
- Added regex back to sub and split by non-alphabet. - Now use two pointers to move from words to words.
MarkKoz
left a comment
There was a problem hiding this comment.
That's working much better.
Closes #231
Applying the algorithm for
Needles and Haystackto find and match tag in tags, for example:This only applies to searching tag_name with more than 3 in length, and at least 80% of its letters are found, from left to right.
There are 3 levels of checking, stop at first found:
If there are more than one hit, it will be shown as suggestions:
In order to avoid api being called multiple times, I've implemented a cache to only refresh itself when the is a gap of more than 5 minutes from the last api call to get all tags.
Editing / Adding / Deleting tags will also modify the cache directly.
What about other solution like fuzzywuzzy?
fuzzywuzzy was considered for using, but from testing, it was giving much lower scores than expected:
Code used to test:
Result from test: